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Abstract 

This paper presents an evolutionary algorithm for modeling the arrival 
dates of document streams, which is any time-stamped collection of doc¬ 
uments, such as newscasts, e-mails, IRC conversations, scientific journals 
archives and weblog postings. This algorithm assigns frequencies (number 
of document arrivals per time unit) to time intervals so that it produces an 
optimal fit to the data. The optimization is a trade off between accurately 
fitting the data and avoiding too many frequency changes; this way the 
analysis is able to find fits which ignore the noise. Classical dynamic pro¬ 
gramming algorithms are limited by memory and efficiency requirements, 
which can be a problem when dealing with long streams. This suggests to 
explore alternative search methods which allow for some degree of uncer¬ 
tainty to achieve tractability. Experiments have shown that the designed 
evolutionary algorithm is able to reach the same solution quality as those 
classical dynamic programming algorithms in a shorter time. We have 
also explored different probabilistic models to optimize the fitting of the 
date streams, and applied these algorithms to infer whether a new arrival 
increases or decreases interest in the topic the document stream is about. 

Keywords: Online text streams , topic detection , evolutionary algo¬ 
rithms , dynamic topic tracking. 


1 Introduction 

The analysis of information flow has become a critical task nowadays. Con¬ 
stantly evolving websites, repositories of news, e-mails, chat logs, and scientific 
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papers are some clear examples of streams whose interpretation highly depends 
on the sequence of occurrence of the documents that constitute it. All these 
examples have a temporal dimension that has to be taken into account when 
analysing their content. 

One of the first attempts to model the dynamic component of a stream 
of documents was the Topic Detection and Tracking (TDT) research project 
PI30, which among other contributions, attempted to settle the different tasks 
that can be distinguished in identifying topics in documents streams. First of 
all, a distinction is established between topic, i.e. a general concept, and event, 
i.e. an occurrence of a topic in a particular time. Among the distinguished tasks 
are the selection of new events, i.e. to identify the occurrence of a document 
discussing a new event; the tracking of a detected event, i.e. to identify a 
collection of documents about the same event, and the segmentation of the large 
streams of documents in substreams related to different topics. Different kinds 
of techniques have been applied to perform these TDT tasks, which involve both 
content similarity analysis and temporal analysis methods. 

More recent papers have focused in identifying time segments in which the 
appearance of documents of a particular topic can be considered relatively sta¬ 
ble. The frequency of occurrences can be very noisy, and this hinders the identi¬ 
fication of intervals of similar frequency. Different statistical techniquesPE, BJ 
have been applied to analyse temporal changes in document streams. 

Another approach is to focus on studying the rise and fall of frequencies to 
detect trends in streams by identifying changes in the frequencies along given 
periods of time. Charikar et al. JJ have proposed an algorithm for finding 
the most frequent elements in a stream, which is also adapted to find elements 
whose frequencies change the most. 

All these papers are reviewed by Kleinberg 0, who also discusses different 
approaches to the problem. 

In this work we are interested in identifying the trends in streams of docu¬ 
ments in a robust and efficient way. This is a task of great interest for news¬ 
makers, and the society at large: when large amounts of data are available, it is 
difficult to answer the question What is everybody talking about? Therefore, our 
problem has some common ground with one of those tackled in j7], although 
in this case we focus on the relative change of frequencies of a single and iso¬ 
lated topic, not different topics as above. Instead of adopting the approach of 
analysing the amount of change for fixed periods of time, we model the flow 
in a segment of time for which enough data are available, following Kleinberg’s 
approach J3J. 

The model is as follows: let us assume a stream of N documents arriving 
along a period of time T, and a set of S frequencies, ranging between 0 and 
1. We have a double goal: on the one hand, we want to determine the most 
appropriate frequency for each interval between one arrival and the next one; on 
the other hand, we are interested in grouping together intervals with “similar” 
frequencies, i.e. relatively stable, in order to wash out the noise. 

Figure ^ shows different possible fitting curves for the frequencies of a se¬ 
quence of documents which have arrived at the instants of time marked in the 
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Time of documents arrival 

Figure 1: Some possibles fitting curves for a stream of documents 


X axis. If the chosen probabilistic model does not penalize changes of state, 
the optimum curve would be the one which assigns to each interval the most 
appropriate frequency, which, in general, amounts to a change of state for each 
arrival. In this case, the chosen fit for the arrival marked in the X axis in Fig¬ 
ure ^ would be the lowest one. However, we want to assign the same state to 
consecutive intervals with similar frequency, ignoring in this way the changes 
produced by the randomness in the arrivals. This can be done by penalizing the 
change of state in different manners. 

Kleinberg J3J, whose method has inspired this work, has developed a frame¬ 
work which formalizes these ideas. He proposes a probabilistic automaton to 
model the frequency of appearance of documents in different intervals. The 
state of the automaton at a particular time determines the expected frequency 
of document occurrences, while transitions between states are probabilistically 
modeled 0. A burst of documents for a topic can therefore be identified by the 
period in which the automaton has stayed in a high frequency state. Given a 
stream of documents related to a particular topic, the Viterbi dynamic program¬ 
ming algorithm cm. can be applied to determine the sequence of automaton 
states which optimizes the measure defined by the chosen probabilistic model. 
This algorithm was initially designed to find the most probable path in a Markov 
chain which produces a given sequence of tags, and it is now widely applied to 
a large range of problems cm 

The streams of document dates to model can be obtained from very differ¬ 
ent environments, though in many cases they are long sequences of documents 
appearing along unlimited periods of time, such as chats lots, e-mails and news 
stories. Since a dynamic programming algorithm is limited by memory and ef¬ 
ficiency constraints, as the length of the stream or the number of automaton 
states grows, it makes sense to explore alternative search methods where a de¬ 
gree of uncertainty is allowed in order to achieve tractability. We have designed 
an evolutionary algorithm to perform the search for the optimal sequence of 
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states according to the selected model. Evolutionary algorithms (EAs) do not 
guarantee reaching the optimum solution, but always produce a reasonably good 
approximation, according to the resources assigned (time and memory). On the 
one hand, an approximate solution is always better than no solution at all; on 
the other hand, one should bear in mind that the probabilistic model used to 
penalize the state transitions and deal with noisy events, is just an approxi¬ 
mation itself, which can be taken in different ways leading to different results, 
as the experiments presented later on will show. What is really interesting for 
most applications, such as the detection of trends in streams of documents, is 
the detection of significant changes of state, i.e. the detection of clear changes 
in the average intensity of document arrivals, and not obtaining a very precise 
optimum numerical result for the cost function. Because of these reasons, this 
work investigates the application of evolutionary algorithms to the problem. 

Besides using an alternative algorithm for fitting frequencies, in this paper 
we have also studied alternative penalization functions to the one proposed by 
Kleinberg [1] for modeling the change of state when new document arrive. We 
have built a number of artificial streams, for which we know the best fit, in order 
to be able to compare the results obtained with the different cost functions. 

The rest of the paper proceeds as follows: section 2 describes the model 
proposed by Kleinberg, which has inspired this work, and the different variant 
we are evaluating; section 3 is devoted to describe an evolutionary algorithm 
used to find the optimal fit of frequency assignments; section 4 presents and 
discusses the experimental results, and section 5 draws the main conclusions 
from this work, and discusses future lines of research. 


2 Models for the Problem 

Our intention in this paper is to devise a model which generates a sequence of 
events and set its parameters in such a way that the stream under study might 
have been generated with this model with a high probability. In order to model 
the arrival of documents in a stream we are going to use a finite state automa¬ 
ton (FSA) as in Kleinberg’s approach U, inspired in turn in models for network 
traffic in queuing theory [5] and on Hidden Markov Models mi- According to 
this approach, a source of traffic emits documents at a rate which depends on 
the state of the FSA at a given point in time. A traffic burst begins with a 
transition from a state of lower rate of emission to one of higher rate. Klein¬ 
berg chooses an exponential density function /( x) = ae~ ax , a > 0 to model the 
probability density of waiting times between arrivals. In the simplest case, we 
can distinguish between only two different states, with high and low frequency 
respectively. In state qo the automaton emits documents at a low rate, which 
gives rise to a probability density for the intervals fo(x) = aoe ~ a ° x , while state 
qi has a higher emission rate, which gives rise to a probability density for the 
gaps given by fi(x) = a\e~ aiX , with o.\ > ao- Now the question is how to 
model the probability p with which the automaton changes its state between 
the emission of two consecutive documents. It is assumed that p is independent 
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of previous emissions and state transitions. Assuming this probability distri¬ 
bution, we can compute the probability of a sequence q = { q \, • • •, q n } of state 
transitions conditioned to the sequence x = {aq, • • •, x n } of gaps observed be¬ 
tween the n + l documents arrived in the stream. The state sequence which 
maximizes this probability P(q\x) is the one which minimizes the cost function 
c(q\x) = — lnP(g|a;). Applying this condition, Kleinberg obtains the formula 

1 — n 

c(q|x)=61n( - -) + Y -In f H (x t ) ( 1 ) 

P ti 

where b is the number of state transitions done to emit the sequence, i.e. the 
number of times in which q(t) q(t +1). In formula Q, we can observe that 
the fewer state transitions, the smaller the first term, while the better the state 
sequence fits the observed sequence of gaps x, the smaller the second term. 
Therefore, it is expected that the optimum sequence of states fits well the gap 
sequence with as few changes in the size of the gaps as possible, depending this 
inertia on the parameter b. 

Afterwards, Kleinberg extends this simple two-state model to one of infinite 
states, providing a different one for each possible intensity of emission. Kleinberg 
proposes an automaton with an initial state qo whose corresponding density 
function aoe~ a ° x is assigned an emission rate ao = n/T , where n is the number 
of gaps between documents emissions and T is the total length of the considered 
period of time; i.e., ao corresponds to a perfectly uniform event emission. For 
the remaining states qi,i > 0, the assigned emission rate is ai = cxqS 1 , where 
s > 1 is a scaling parameter, i.e. the smaller the gap, the greater the intensity. 
Then, by analogy with the two-state model, the cost function which the selected 
sequence of states must minimize, is 

n— 1 n 

c(q|x) = Y r (^> **+i) + Y ~ ln fn( Xt ) ( 2 ) 

t=o t =1 

where r(q,,gj) represents the cost of a state transition from the state qt at a 
given time t to the state qj at time t + 1. Kleinberg, which considers that the 
selection of r(qi, qj) is very flexible, chooses r(qi, qj) in such a way that the cost 
of changing from a lower intensity state to a higher intensity one is proportional 
to the number of involved states, while there is no cost for changing for higher 
to lower intensity states. Specifically, the cost associated to change from state qi 
to qj, where j > i, is defined as (j — i)jhm, 7 being a parameter of the model. 
A* T denotes the automaton of infinite states with parameters s and 7 . The 
parameter s controls the scale for the rate values of the states, while 7 , which 
Kleinberg sets to 1 in his experiments, controls the resistance to changing state. 

Kleinberg shows that computing an optimal state sequence in an automata 
A* „ with infinite states is equivalent to compute q in one of its finite restrictions 
Ai? T , obtained by deleting from the automaton all states but the first k of them. 
This result allows establishing algorithms to compute the sequence of states for 
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the minimum cost. For this purpose, Kleinberg adopts the standard dynamic 
programming algorithm used for hidden Markov Models. 

However, this penalty function is not the only possible one. We have studied 
other penalization functions. On the one hand, we have tested cost functions 
which also penalize the change from a state with high intensity to a state of 
low intensity. On the other hand, we have searched for measures which do not 
depend on the number of states. We have chosen the following set of measures: 


defined as 


(j — i)^hm if j > i 


, the one proposed by Kleinberg, 


0, if i < j 

where n stands for the number of documents in the stream. 

Tb, defined as |j — ijyhm, similar to r a but now there is also a cost by 
changing from a higher intensity state to a lower one. 

. r c , defined as / log ^'" ^ if j > 1 


0, if i < j 

zero penalty for high-to-low intensity changes. 


other penalization function with 
langes. 

Td, defined as log(|j — *|) 7 , the counterpart of r c which also penalizes 

which has zero penalty for high-to- 


high-to-low intensity changes. 
r e , defined as 


V ( 3 - *)7 if j > i 
0, if i < j 
low intensity changes. 

Tf, defined as \J\j — ijy, the counterpart of r e . 


• T g , defined as 


(j - i)l/\nE if j >i 


, introduced to avoid the depen- 


0, if i < j 

dency with the number of states. E is the total number of automaton 
states. 

Th, defined as |j — i\"//hiE, the counterpart of r g which penalizes any 
change of state. 


The results obtained with the different functions have been compared in the 
section devoted to the experiments (section 0J. 


3 The Search Algorithm: Evolutionary Algo¬ 
rithm 

Systems based on evolutionary algorithms m maintain a population of poten¬ 
tial solutions to the problem and apply some selection process based on the 
quality or fitness of individuals, as natural selection does. The population is 
renewed by replacing individuals with those obtained by applying “genetic” op¬ 
erators to selected individuals. The most usual “genetic” operators are crossover 
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and mutation. Crossover obtains new individuals by mixing, often in some prob¬ 
lem dependent way, two individuals, called parents. Mutation produces a new 
individual by performing some kind of random change on an individual. The 
production of new generations continues until resources are exhausted or until 
some individual in the population is fit enough. Evolutionary algorithms have 
proved very useful as search and optimization methods, and have previously 
been applied to different issues of natural language processing mi nu, such as 
text categorization, inference of context-free grammars, tagging and parsing. 

In our case, individuals represent sequences of state transitions in the au¬ 
tomaton. The fitness of individuals is the cost function associated to the se¬ 
quence of state transitions, and depends on the chosen probabilistic model. 

3.1 Individual representation 

Let E the number of states chosen for the automaton. Let us assume a stream 
of n + 1 documents arriving along a period of time T. Then, the individuals 
of our evolutionary algorithm could be represented as the list of automaton 
states corresponding to each arrival of a document. Accordingly, an individual 
would be a list of n genes gi , where gt £ {0, • • •, E} is the state q in which the 
automaton is after the arrival of document i. 




9 (* 2 ) 


q(tn) 


However, the arrival of a new document does not produce a state transition 
in many cases. Therefore, the sequence of transitions can be represented in a 
more compact manner. Thus, an individual is a variable length list, in which 
each position, or “gene”, represents the arrival of a subsequence of documents 
which do not lead to a change of state. Each gene is composed of an automaton 
state and of an identifier of the first and the last document in the subsequence. 
Therefore, the individuals of our evolutionary algorithm could be represented 
as the list of state transitions in the automaton caused by the arrival of the 
documents. 


3i 

92 


9f 

q(h), {ti,t kl ) 

q{t k i +1), (t kl+ll t k2 ) 


(tf —1 +1) tn) 


3.2 The Fitness Function 

For the fitness function we take, quite naturally, the cost function which defines 
the chosen statistical model. Thus the goal of our evolutionary algorithm is 
to find the sequence of state transitions q = ( q^ , ■ ■ •, (?»„) which minimizes the 
function 

n—1 n 

c(q|x) = ^2 r(i t , i t +i) + ^2 ~ ln CHe~ aiXt , 

t=0 i=1 

with the penalty cost r(i,j) and the state parameter, a*, of the chosen statistical 
model. In order to compute this function, the implicit automaton underlying 
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the model must be completely defined, i.e. we have to assign values to each 
Qfj. We assume that the number of automaton states has previously been fixed 
to E. Following Kleinberg’s approach, we establish a uniform state qq, with 
a document arrival rate «o = n/T, which corresponds to uniform document 
arrivals. For the remaining states, qi,i > 0, the arrival rate is on = aos l , where 
s > 1 is a scaling parameter, i.e. the arrival intensity increases geometrically 
with i. 

If we knew the value for the rate on of a particular state, we could obtain 
the value for s from: 



By assigning to the maximum arrival rate in the automaton, a particular 
value, such as 1 , we obtain a value for s: 

s = exp{(ln atE — In n + In T)/E} 

Depending on the form of the selected cost function r(i,j ) for a state tran¬ 
sition from state qt to qj , further parameters must be fixed. For example, if we 
use the cost function r a of section Q the one used by Kleinberg, we must give 
a value to the parameter 7 , on which the inertia of the automata to change its 
state depends. In his experiments, Kleinberg chose 7 = 1, as we also do. 

3.3 Initial Population 

The first step of an evolutionary algorithm is the creation of a population of 
individuals, or potential solutions to the problem. In our case, these individuals 
represent sequences of state transitions randomly generated. The simplest way 
of creating one such sequence is to choose a few documents at random and use 
them to split the whole document stream into intervals, each of which is assigned 
a random state. However, some preliminary experiments we have performed 
have shown that such a simple strategy gives rise to a 

search space that is too large for the algorithm to be efficient. Accordingly, 
we choose for a state transition only those documents for which the gaps with 
the previous document and with the following one are sufficiently different (the 
size of one at least 50 % longer than the other). The interval before the first 
transition is assigned a random state, and this state is increased or decreased 
in successive intervals according to whether the gaps on the left and right of 
the partition points increase or decrease in length, respectively. The size of this 
change is randomly chosen. 

3.4 Crossover Operator 

We have implemented the classic one point crossover, which creates two offspring 
by combining two individuals in such a way that the first part of one parent up 
to a crossover point is combined with the second part of the other parent and 
vice versa. Afterwards, the best offspring substitutes the worst parent. This is 
a steady state, elitist strategy. 


The steps to apply this operator are the following: 

• In order to select the crossover point, we randomly select one of the dates 
of document arrival in the stream. Then, we search in both parents the 
gene which contains this document. 

• Then, the genes on the right-hand side of the selected gene in both parents 
are exchanged. 

• For the gene containing the crossover point, we must decide if the sub¬ 
stream of documents —which, in general, is different in both parents— 
is going to be joined or split. Experiments have shown that taking this 
decision at random produces bad results. Accordingly, if the gaps to the 
left and to the right of the crossover point are comparable, the substream 
is assigned a single gene whose state is randomly selected from one of the 
parents. Otherwise, the substream is split at the crossover point in two 
genes, each taking the state from one parent. 

3.5 Mutation Operator 

The mutation operator is applied to every individual of the population with 
a probability given by the mutation rate. Different variants of mutation have 
been implemented, selecting at random the one to apply in each case: 

• One of these mutation operators amounts to choosing a gene at random 
and randomly increment or decrement its state by one unit. 

• Other mutation operator joins two consecutive genes to produce a single 
one. The state of the new gene is randomly taken from one of the original 
genes. 

• The last mutation operator splits a gene in two; each one is assigned a 
different state: one of them is given the state of the original gene and the 
other one is given the previous state plus or minus one (plus if the gap 
on the left of the partition point is longer than the one on the right, and 
minus otherwise). This operator is only applied if the gaps in both sides 
of the partition are different. 

4 Experimental Results 

The first step in our experiments has been to investigate which cost function 
works better. To do this, we have generated artificial streams with different fea¬ 
tures, for which we know the probability distribution of the document arrivals. 
The results of an EA are usually sensitive to settings such as the population 
size, the number of generations and the rates of application of the crossover and 
mutation operators. Because of this, we have investigated the range of values 
for these parameters which provide best results. We have also compared the 
fits obtained with a dynamic programming algorithm with the EA. Finally, we 
have also tested our system in some real world streams. 
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Figure 2: Artificial stream generated to evaluate cost functions. 


4.1 Evaluating Cost Functions 

In principle, it is not a trivial task to compare the results obtained using different 
cost functions. It is pointless to compare the numerical value provided by each 
of them for the cost, because, in general, they are going to be different even for 
the same fitting curve. The best way to perform the comparison is to apply the 
different measures to a stream for which we know the emission frequencies that 
generated it. In this case, we can compare the correct fitting curve, with the 
curves obtained with the different cost functions. Because real world streams 
present too much noise to intuitively determine the best fit, we have applied 
this method to two kinds of artificial streams, composed of a number of inter¬ 
vals spanning the total time-length. These intervals have been designed by hand 
to present different degrees of difficulty for the algorithm. The program which 
generates streams of the first kind reads from an input file the number of docu¬ 
ments arriving within each interval and the gap between every two consecutive 
ones, which is constant along the interval. 

Intervals correspond to steps in the frequency-time representation. Figure0 
represents a stream of this type, which is composed of 220000 dates and presents 
ascending and descending steps. 

Figure 0 shows the fitting curves obtained by applying the Viterbi algorithm 
with the different cost functions for the stream of Figured We can observe that 
the best fittings, i.e. closest to the base curve of frequencies used for generating 
the stream, are obtained with functions r e and r g . Notice that these functions 
are able to detect the narrow peak in the stream (marked with an arrow in 
the figure). However, we choose r g because it is normalized with respect to the 
number of states, which makes its behavior more insensitive to this parameter. 
Measures Tf,, r^, t/ and 77, whose changes from a higher frequency state to a 
lower one are also penalized, produce fitting curves that are much too flat. 

We have also used for the evaluation another kind of stream with random 
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Figure 3: Fitting curves obtained with different cost functions for the stream of 
Figure[21 Clockwise from the top-left corner, r a and Tb, t c and r,j, r g and 7> t , r e 
and Tf. 
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gaps. To generate it, every time unit we decide that a document arrives or 
not with a given probability (its frequency). Figure 0 shows the fits obtained 
by applying the Viterbi algorithm to an automaton of 10 states for a stream 
generated according to the probability distribution of Table ^ 

The first row of this table indicates the initial dates of each segment, the 
second one the last date, and the third one is the average frequency at which 
documents are emitted. Gaps between documents in a segment are thus random. 
In Figurc@]the most accurate and smoothest fit is again obtained with the cost 
function r g . This function and r Q are the ones which most accurately reproduce 
the frequency changes. However, r g achieves a better fit of the frequencies. 
In this case, cost functions 77 ,, 777 , r/ and r/j, are too sensitive to the noise, 
overfitting the arrival time curve. 


Ini 

0 

1001 

2001 

3001 

4001 

5001 

End 

1000 

2000 

3000 

4000 

5000 

6000 

Freq. 

0.002 

0.89 

0.004 

0.9 

0.001 

0.99 


Table 1: Probability distribution used to generate a stream with random gaps. 


As a conclusion, we have used r g instead of the cost function proposed by 
Kleinberg in his paper, since it provides a better fit for known document streams. 
This function avoids the dependency with the number of states that was fea¬ 
tured by Kleinberg’s proposed one; the main difference is that while Kleinberg’s 
function (r a here) uses as a scale factor the number of documents, ours (r s here) 
scales by the number of states. As can be seen in figures 0 andthe difference 
is bigger in the case of a randomly generated stream than in the case of a 
stream with uniform gaps ©, which probably means that, in the general case, 
r g will yield better results. 

4.2 Evolutionary Algorithm Parameters 

In EAs there is always a trade-off between two fundamental factors in the evo¬ 
lutionary process: population diversity and selective pressure. An increase in 
selective pressure will induce a decrease in population diversity, leading to a 
premature convergence of the EA, while a weak selective pressure can made the 
search ineffective. The relation between these factors is mainly determined by 
the population size, and the rates of application of crossover and mutation. A 
too small population size lacks the required diversity and the algorithm usually 
converges to a bad result. However, much too large populations require a long 
time to converge to a solution. 

Crossover and mutation rates, also critical for the effectiveness of the algo¬ 
rithm, must be correlated with the population size to provide the EA with a 
suitable population diversity. The most appropriate values for these parameters 
are strongly problem-dependent. We have used the artificial streams (a) and (b) 
depicted in Figure0and the stream of Figured from the previous subsection, to 
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Figure 4: Fitting curves obtained with different cost functions for the stream 
generated with the probability distribution appearing in Tabled 




(a) (b) 

Figure 5: Additional artificial streams generated to evaluate the EA. 
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study the best parameter settings for the EA. The first one (a) is characterized 
by ascending steps. Its length is 220000 dates of document arrivals. Stream 
(b) presents both, ascending and descending steps and has a length of 129000 
dates. Stream (c), also presents both kinds of steps, but in this case, the length 
of the steps is very different. 

Figures EJa) and|Cfb) show the number of iterations required for the EA to 
reach convergence for the streams (a), (b) of Figure^ and stream (c) from Fig¬ 
ure^ with different rates of application of crossover and mutation, respectively. 
Convergence is achieved if the difference between the average fitness value of 
successive generations lies below a threshold for a number of generations. 

We can observe that intermediate values of these parameters, such as a 
crossover rate of 40 % and a mutation rate of %10, are enough to quickly reach 
convergence, which makes the EA perform efficiently. 

FigureQA) presents the number of iterations required to reach the optimum 
value for the stream depicted in Figure fusing different population sizes. We 
can observe that a population size of 200 individuals is enough. Larger sizes, 
although are also valid, increase the execution time. Figure0b) shows the value 
of the cost function as the number of iterations increases. This chart shows that 
convergence can be reached very quickly. 
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(a) 


(b) 


Figure 6: Number of generations required to reach convergence (as defined in 
the text) with different rates of crossover and mutation. Results correspond to 
the streams in Figure|S](a and b) and to the stream of Figure^ (c). 


4.3 Comparing the EA with a classic algorithm 

In order to test the advantages of using an EA for the considered problem, 
we have compared its results with those obtained with the Viterbi dynamic 
programming algorithm. Tables [3 0 and Q] show the values obtained for the cost 
function and the execution time when using a dynamic programming algorithm 
and the EA (best result of five runs, average and standard deviation). 
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Figure 7: Number of generations requires to reach convergence with different 
population sizes (a) and evolution of the cost function with the number of gen¬ 
eration (b). Results correspond to the streams in Figured (a and b) and to the 
stream of Figure 0 (c). 


State n. 

Viterbi 

Evo. Alg 


Ex. time 

Cost 

Ex. time 

Cost 

Av. Cost 

Std. dev. 

5 

2394.44 

734721 

614.83 

734721 

735242.6 

824.72 

10 

4789.13 

704021 

1925.61 

704021 

707342.2 

3081.98 

15 

7188.22 

699429 

2688.49 

699429 

702288.4 

2989.61 

20 

9588.8 

696703 

2745.35 

696703 

696703 

0 

25 

12001.2 

696210 

1220.35 

696210 

696210 

0 


Table 2: Execution time in seconds and value of the cost function for the stream 
of Figure [3 (a) when applying the Viterbi and the Evolutionary algorithms to 
find the optimal sequence of states in automata with different number of states. 
The cost function used has been r g . The EA has been run with a population 
size of 200 individuals, a maximum number of 200 iterations, a crossover rate of 
40 % and a mutation rate of 5%. The values presented for the EA are the best 
cost (and the running time associated with it), the average and the standard 
deviation of five runs. 
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State n. 

Viterbi 

Evo. Alg. 


Ex. time 

Cost 

Ex. time 

Cost 

Av. Cost 

Std. dev. 

5 

792.08 

279464 

632.63 

279464 

280897.6 

839.70 

10 

1547.4 

277796 

1515.57 

277893 

278785.4 

899.03 

15 

2319.36 

277402 

1678.61 

277712 

279385.6 

980.11 

20 

3117.28 

277306 

2182.12 

277528 

278980.4 

1114.91 

25 

3835.37 

277260 

2033.81 

277270 

279472.6 

1116.03 


Table 3: Execution time in seconds and value of the cost function for the stream 
of Figure 0(b) when applying the Viterbi and the Evolutionary algorithms to 
find the optimal sequence of states in automata with different number of states. 
The cost function employed has been r g . The EA has been run with a population 
size of 200 individuals, a maximum number of 200 iterations, a crossover rate of 
40 % and a mutation rate of 5%. The values presented for the EA are the best 
cost (and the running time associated with it), the average and the standard 
deviation of five runs. 


State n. 

Viterbi 

Evo. Alg. 


Ex. time 

Cost 

Ex. time 

Cost 

Av. Cost 

Std. dev. 

5 

2377.35 

795853 

1719.27 

795853 

799612.8 

3098.99 

10 

4781.91 

794821 

3353.18 

794821 

795938.2 

1117.64 

15 

5666.24 

794381 

3984.38 

794381 

798642.0 

3007.618 

20 

9559.78 

794256 

5804.92 

794256 

797285.0 

3023.408 

25 

11975.5 

794199 

5550.31 

795013 

798818.8 

3448.08 


Table 4: Execution time in seconds and value of the cost function for the stream 
of Figure |21 when applying the Viterbi and the Evolutionary algorithms to find 
the optimal sequence of states in automata with different number of states. The 
cost function employed has been r g . The EA has been run with a population 
size of 200 individuals, a maximum number of 200 iterations, a crossover rate of 
40 % and a mutation rate of 5%. The values presented for the EA are the best 
cost (and the running time associated with it), the average and the standard 
deviation of five runs. 
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Figure 8: Steps of the fitting curve obtained with the Viterbi algorithm and the 
Evolutionary algorithm for the stream (b) from Figure O with a 10 state au¬ 
tomaton. The EA is run on a population size of 200 individuals for a maximum 
number of 200 generations, with a crossover rate of 40 % and a mutation rate 
of 5 %. 


It can observed that the EA always yields the shortest running time. In 
some cases the value obtained for the cost function with both algorithms and the 
number of states in the automaton is the same. There are other cases in which 
the value provided by Viterbi is slightly better. However, in all these cases the 
fitting curve obtained with both algorithms is the same and the only difference 
is the absolute state number assigned to different steps; but the relative change 
of state, and therefore of frequency, is maintained. For example, in the case 
of the stream (b), Figure [3] shows the results obtained with an automaton of 
10 states, with the Viterbi algorithm and the EA. We can observe that the 
number of steps found in each curve is the same, the size of each step and the 
document of the beginning and the end of each step is also the same in both 
algorithms, and the relative change of state is also maintained. The only change, 
marked in the figure, is the absolute state assigned to the interval between the 
documents 60000 and 75000. Therefore, both solutions are equally useful in 
order to detect the burst of interest in the stream and to estimate the future 
trend. Furthermore, the values obtained for the average cost and standard 
deviation, which is below 0.5% in all cases, show that the EA is highly robust. 

4.4 Evaluating Real World Document Streams 

Finally, we have studied the behavior of some streams obtained from the real 
world. They have been obtained from the Blogalia, weblog hosting site 1 , by 
doing a database search on certain words: (a) rss, (b) gmail, (c) terrorismo 2 

1 http://www.blogalia.com 

^terrorism 
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and (d) atentado terrorista 3 . Figure 0 shows the fitting curve obtained by 
the EA with a population size of 200 individuals, a maximum number of 1000 
iterations, a crossover rate of 40 %, a mutation rate of 5 % and an automaton of 
25 states. We can observe that the fit for rss (Figured (a)) shows many peaks, 
although they are mainly concentrated in the second part of the considered 
period of time. This behavior can be expected for a stream of a topic on which 
the interest can be very variable. The curve for the gmail stream (Figure 0 
(b)) presents a clear burst on two particular intervals of time approximately in 
the middle of the considered period, which roughly corresponds to new waves 
of GMail (the Google webmail system, at http://gmail.com) invitations. The 
two last fits (Figure 0 (c) and (d)) are devoted to two related topics, terrorism 
and terrorist outrage, and thus, we can observe that although the intensity of 
document arrivals is greater in (d), because the topic is more general, the bursts 
in both streams appear approximately at the same times, as it can be expected. 


5 Dynamic Detection of Changes of Interest in 
Document Streams 

We have also investigated the dynamic detection of changes in the trends of 
documents streams. Let us assume the fit for a stream has been found, new 
documents arrive, and we want to detect possible changes in the trends of the 
corresponding topic. Clearly, the most accurate way of doing this is to apply 
again the algorithm for finding the best fit for the extended stream. However, 
thinking about real-time applications for the web, we have investigated ways to 
fit just one new document, without fitting the whole stream until there is time 
to do it. This can be done by searching the minimum cost for the new gap x 
according to the last state of the previous fit. Thus, if the fit obtained for the 
previous stream finished at state qi , we look for the state qj which minimizes 
the cost 


arg min qj P(qj\qi,x) = arg min qj (T(qi, qj) + In aye aiX ) 

In order to check if this “local” approximation is meaningful, we have chosen one 
real world stream, the one tracking the term gmail , and have performed some 
cuts on it. Table 0 shows the results for some of these cuts. This mechanism, 
which allows us to detect immediately changes in the trends of a stream, has 
properly worked in all these cases, yielding the state that had previously been 
found by the fitting algorithms. 

Another improvement to quickly obtain a fit for an extension of a previously 
fitted stream, is to use the previous fitting curve as a seed for the EA which 
searches the fit of the extended stream. To implement this mechanism, we have 
modified the way in which the EA creates the initial population. A seed indi¬ 
vidual is created, its last gene extended with the new substream of document 

terrorist attack 
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Figure 9: Fitting curves for some real world document streams; each stream 
corresponds to blog posts from Blogalia that include the following words: a) 
RSS; b) gmail c) atentado and d) atentado terrorista 


Previous substream 

Arrival Time 

Old state 

New state 

Trend 

• •• 38 38 39 41 49 49 

52 

12 

0 

1 

• •• 41 49 49 52 68 69 

69 

3 

4 

t 

• • • 88 89 90 90 91 92 

95 

0 

0 



Table 5: Results of applying a local approximation to detect changes in the 
trend of a stream. 
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dates. Then, the initial population is created by applying the mutation oper¬ 
ator to this seed individual, but in such a way that the last gene has a higher 
probability to undergo mutation. Once the initial population has been created, 
the EA proceeds as before. 

Experiments have shown that this mechanism can save a lot of execution 
time. Table © shows the time required to fit the stream of Figure EJc) if a part 
of it, whose length appears in the first column, has already been fitted. We can 
observe that the time required is very small. 


Subst. Len. 

New Subs. len. 

T. w/out seed 

T. w/ seed 

219900 

100 

3895.28 

141.45 (79.09) 

219000 

1000 

3895.28 

144.75 (81.96) 

220000 

10000 

3895.28 

166.73 (79.32) 


Table 6: Time (seconds) spent without seed and using the previous fit as a seed 
for some substream taken from the stream of Figure [Sfc). The first column 
indicates the length of the previously fitted substream, and the second one the 
length of the stream of documents which has to be added to the previous one. 
The third column presents the time needed to fit the whole stream, while the 
last one presents the time spent to reach convergence by using the previous fit 
as seed, with a population size of 200 individuals, a crossover rate of 40 % and 
a mutation rate of 10%. The result for a population size of 100 individuals 
appears in parentheses. 


6 Conclusions and future work 

In this paper, we have presented the design of a system devoted to the dynamic 
detection of changes on the trends of the topics of a stream of documents, such as 
newscasts, e-mails, IRC conversations, scientific journals or weblogs. It is based 
on modeling the assignment of frequencies to intervals of document arrivals and 
obtaining an optimal fit to the data. We have designed an evolutionary algo¬ 
rithm to implement the model, what allows us to deal with very large sequences 
of documents in a reasonable time, obtaining fitting curves with a similar shape 
to those provided by classic dynamic algorithms. We have studied different is¬ 
sues of the model and its implementation, such as the criterion to change to 
an interval with a new frequency. We have found that penalizing the change 
of an interval to another with different frequency, whether it is higher or lower, 
provides worse results that penalizing only those changes to an interval of higher 
frequency. We have also tested different functions for this last type of penal¬ 
ization, and found that the results depend weakly on this function. Because of 
this, we prefer a function which normalizes the penalization with respect of the 
number of frequency values considered, in order to make the results independent 
of this parameter. 
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We have also designed a version of the evolutionary algorithm which dra¬ 
matically reduces the time required to find the optimal fit to a stream which is 
an extension of a previously fitted substream. This version of the evolutionary 
algorithm uses the previous fit as a seed to generate the initial population, which 
can quickly converge if most of the stream has been previously fitted. In this 
way, our system can be applied to dynamically model the document stream, and 
thus detect changes on the trends of the corresponding topic in real time. Be¬ 
sides, the fitting curves produced by the system for a stream of documents can 
also be useful for other applications: the fit obtained for streams corresponding 
to different topics can help to detect correlations between these topics, to study 
how a topic affects others, etc. 

Future lines of works will include: 

1. An exhaustive study of the evolutionary algorithm parameters, to obtain 
better performance. 

2. Study of correlation among document streams, to automatically detect 
the occurrence of new topics composed of multi-word concepts. This can 
also be helped by other techniques. 

3. Optimization for real-time operation, including parallelization of the al¬ 
gorithms. 

4. More extensive checking in many different document streams. 

5. Characterization of document streams via its model. 
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